Pan-Cancer Copy Number Signature Identification and Analysis

Shixiang Wang, Ziyu Tao, Tao Wu, Huimin Li, Xue-Song Liu (Corresponding author)

2020-12-16


This document is compiled from an RMarkdown file which contains all code or description necessary to reproduce the analysis for the accompanying project. Each section below describes a different component of the analysis and most of numbers and figures are generated directly from the underlying data on compilation.

LICENSE

If you want to reuse the code in this report, please note the license below should be followed.

The code is made available for non commercial research purposes only under the MIT. However, notwithstanding any provision of the MIT License, the software currently may not be used for commercial purposes without explicit written permission after contacting Shixiang Wang or Xue-Song Liu .

PART 0: Data preprocessing

In this part, raw data are collected from databases or papers and the core pre-processing steps are described in sections below.

The pre-processing work has been done by setting root path of the project repository as work directory. Therefore, keep in mind the work directory should be properly set if you are interested in reproducing the pre-processing procedure.

Prepare PCAWG datasets

We download PCAWG phenotype and copy number data from UCSC Xena and save them to local machine with R format.

dir.create("Xena")

# Phenotype (Specimen Centric) -----------------------------------------------

library(UCSCXenaTools)
pcawg_phenotype <- XenaData %>%
  dplyr::filter(XenaHostNames == "pcawgHub") %>%
  XenaScan("phenotype") %>%
  XenaScan("specimen centric") %>%
  XenaGenerate() %>%
  XenaQuery()
pcawg_phenotype <- pcawg_phenotype %>%
  XenaDownload(destdir = "Xena", trans_slash = TRUE)

phenotype_list <- XenaPrepare(pcawg_phenotype)
pcawg_samp_info_sp <- phenotype_list[1:4]

saveRDS(pcawg_samp_info_sp, file = "data/pcawg_samp_info_sp.rds")

# PCAWG (Specimen Centric) ------------------------------------------------

download.file(
  "https://pcawg.xenahubs.net/download/20170119_final_consensus_copynumber_sp.gz",
  "Xena/pcawg_copynumber_sp.gz"
)

pcawg_cn <- data.table::fread("Xena/pcawg_copynumber_sp.gz")
pcawg_samp_info_sp <- readRDS("data/pcawg_samp_info_sp.rds")

sex_dt <- pcawg_samp_info_sp$pcawg_donor_clinical_August2016_v9_sp %>%
  dplyr::select(xena_sample, donor_sex) %>%
  purrr::set_names(c("sample", "sex")) %>%
  data.table::as.data.table()

saveRDS(sex_dt, file = "data/pcawg_sex_sp.rds")

pcawg_cn <- pcawg_cn[!is.na(total_cn)]
pcawg_cn$value <- NULL
pcawg_cn <- pcawg_cn[, c(1:5, 7)]
colnames(pcawg_cn)[1:5] <- c("sample", "Chromosome", "Start.bp", "End.bp", "modal_cn")

saveRDS(pcawg_cn, file = "data/pcawg_copynumber_sp.rds")

Generate PCAWG sample-by-component matrix

We then read the copy number data and transform it into a CopyNumber object with R package sigminer. Then sig_tally() is used to generate the matrix for NMF decomposition.

library(sigminer)

# Only focus autosomes, suggested by Prof. Liu
pcawg_cn <- readRDS("data/pcawg_copynumber_sp.rds")
table(pcawg_cn$Chromosome)
pcawg_cn <- pcawg_cn[!Chromosome %in% c("X", "Y")]
cn_obj <- read_copynumber(pcawg_cn,
  add_loh = TRUE,
  loh_min_len = 1e4,
  loh_min_frac = 0.05,
  max_copynumber = 1000L,
  genome_build = "hg19",
  complement = FALSE,
  genome_measure = "called"
)
saveRDS(cn_obj, file = "data/pcawg_cn_obj.rds")

# Tally -------------------------------------------------------------------
library(sigminer)
cn_obj <- readRDS("data/pcawg_cn_obj.rds")
table(cn_obj@data$chromosome)

tally_X <- sig_tally(cn_obj,
  method = "X",
  add_loh = TRUE,
  cores = 10
)
saveRDS(tally_X, file = "data/pcawg_cn_tally_X.rds")
str(tally_X$all_matrices, max.level = 1)

p <- show_catalogue(tally_X,
  mode = "copynumber", method = "X",
  style = "cosmic", y_tr = function(x) log10(x + 1),
  y_lab = "log10(count +1)"
)
p
ggplot2::ggsave("output/pcawg_catalogs_tally_X.pdf", plot = p, width = 16, height = 2.5)

head(sort(colSums(tally_X$nmf_matrix)), n = 50)

## classes without LOH labels
tally_X_noLOH <- sig_tally(cn_obj,
  method = "X",
  add_loh = FALSE,
  cores = 10
)
saveRDS(tally_X_noLOH$all_matrices, file = "data/pcawg_cn_tally_X_noLOH.rds")
str(tally_X_noLOH$all_matrices, max.level = 1)

p <- show_catalogue(tally_X_noLOH,
  mode = "copynumber", method = "X",
  style = "cosmic", y_tr = function(x) log10(x + 1),
  y_lab = "log10(count +1)"
)
p
ggplot2::ggsave("output/pcawg_catalogs_tally_X_noLOH.pdf", plot = p, width = 16, height = 2.5)

The generated catalog profiles for LOH version and non-LOH version component classification can be viewed by the following link:

Prepare TCGA datasets

TCGA allele-specific copy number data are downloaded from GDC portal and transformed into R format by Huimin Li.

The data is go further checked and cleaned by Shixiang.

# Huimin collected TCGA data from GDC portal
# and generate file with format needed by sigminer
# Here I will firstly further clean up the data

library(data.table)
x <- readRDS("data/TCGA/datamicopy.rds")
setDT(x)

colnames(x)[6] <- "minor_cn"
head(x)
x[, sample := substr(sample, 1, 15)]
saveRDS(x, file = "data/TCGA/tcga_cn.rds")

rm(list = ls())

TCGA phenotype data is downloaded from UCSC Xena.

download.file(
  "https://pancanatlas.xenahubs.net/download/Survival_SupplementalTable_S1_20171025_xena_sp.gz",
  "Xena/Survival_SupplementalTable_S1_20171025_xena_sp.gz"
)

tcga_cli <- data.table::fread("Xena/Survival_SupplementalTable_S1_20171025_xena_sp.gz")
table(tcga_cli$`cancer type abbreviation`)
saveRDS(tcga_cli, file = "data/TCGA/tcga_cli.rds")

Generate TCGA sample-by-component matrix

Similar to PCAWG, TCGA matrices are also generated.

library(sigminer)

# Only focus autosomes
tcga_cn <- readRDS("data/TCGA/tcga_cn.rds")
table(tcga_cn$chromosome)
tcga_cn <- tcga_cn[!chromosome %in% c("chrX", "chrY")]
cn_obj <- read_copynumber(
  tcga_cn,
  seg_cols = c("chromosome", "start", "end", "segVal"),
  add_loh = TRUE,
  loh_min_len = 1e4,
  loh_min_frac = 0.05,
  max_copynumber = 1000L,
  genome_build = "hg38",
  complement = FALSE,
  genome_measure = "called"
)
saveRDS(cn_obj, file = "data/TCGA/tcga_cn_obj.rds")

# Tally step
# Generate the counting matrices
# Same as what have done for PCAWG dataset
library(sigminer)
cn_obj <- readRDS("data/tcga_cn_obj.rds")
table(cn_obj@data$chromosome)

tally_X <- sig_tally(cn_obj,
  method = "X",
  add_loh = TRUE,
  cores = 10
)
saveRDS(tally_X, file = "data/TCGA/tcga_cn_tally_X.rds")


p <- show_catalogue(tally_X,
  mode = "copynumber", method = "X",
  style = "cosmic", y_tr = function(x) log10(x + 1),
  y_lab = "log10(count +1)"
)
p
ggplot2::ggsave("output/tcga_catalogs_tally_X.pdf", plot = p, width = 16, height = 2.5)

head(sort(colSums(tally_X$nmf_matrix)), n = 50)

## classes without LOH labels
tally_X_noLOH <- sig_tally(cn_obj,
  method = "X",
  add_loh = FALSE,
  cores = 10
)
saveRDS(tally_X_noLOH$all_matrices, file = "data/TCGA/tcga_cn_tally_X_noLOH.rds")
str(tally_X_noLOH$all_matrices, max.level = 1)

p <- show_catalogue(tally_X_noLOH,
  mode = "copynumber", method = "X",
  style = "cosmic", y_tr = function(x) log10(x + 1),
  y_lab = "log10(count +1)"
)
p
ggplot2::ggsave("output/tcga_catalogs_tally_X_noLOH.pdf", plot = p, width = 16, height = 2.5)

The generated catalog profiles for LOH version and non-LOH version component classification can be viewed by the following link:

Other PCAWG genotype/phenotype data

Most of other PCAWG genotype/phenotype data used in this study are obtained from PCAWG paper collection published in early 2020.

TODO: describe detailed datasets.

PART 1: Copy number signature identification and activity attribution

In this part, how the copy number signatures are identified and how the corresponding signature activities (a.k.a. exposures) are attributed are described in sections below. The 176 component classification for PCAWG data is our main focus.

The work of this part has been done by either setting root path of the project repository as work directory or moving data and code to the same path (see below). Therefore, keep in mind the work directory/data path should be properly set if you are interested in reproducing the analysis procedure.

The signature identification work cannot be done in local machine because high intensity calculations are required, we used HPC of ShanghaiTech University to submit computation jobs to HPC clusters. Input data files, R scripts and PBS job scripts are stored in a same path, the structure can be viewed as:

.
├── 04-call-tcga-signatures-with-BP-types.R
├── 05-call-pcawg-signatures-with-BP-types.R
├── bp_pcawg.pbs
├── bp_pcawg_types.pbs
├── bp_tcga.pbs
├── bp_tcga_types.pbs
├── call_pcawg_bp.R
├── call_pcawg_sp.R
├── call_tcga_bp.R
├── call_tcga.R
├── pcawg_cn_tally_X.rds
├── pcawg.pbs
├── tcga_cn_tally_X.rds
├── tcga.pbs

Copy number signature identification

The signature identification pipeline is adopted from Degasperi et al. (2020), we implemented the procedure in our package sigminer by following the method description. A benchmark has been done to make sure our implementation works properly.

Identification pipeline

To make the signature identification procedure more clear, we drew a flowchart. We name this pipeline “Best Practice” and use its short name BP (or bp).

TODO: add flowchart

Apply to PCAWG catalog matrix

PBS file bp_pcawg.pbs content:

#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=50
#PBS -S /bin/bash
#PBS -l mem=300gb
#PBS -j oe
#PBS -M w_shixiang@163.com
#PBS -q slst_pub

# Please set PBS arguments above
cd /public/home/wangshx/wangshx/PCAWG-TCGA
module load apps/R/3.6.1

# Following are commands/scripts you want to run
Rscript call_pcawg_bp.R

R script call_pcawg_bp.R content:

library(sigminer)

tally_X <- readRDS("pcawg_cn_tally_X.rds")
tally_X_noLOH <- readRDS("pcawg_cn_tally_X_noLOH.rds")

sigs <- bp_extract_signatures(
  tally_X$nmf_matrix,
  range = 2:30,
  n_bootstrap = 10,
  n_nmf_run = 10,
  cores = 50,
  cache_dir = "pcawg_bp_pynmf",
  keep_cache = TRUE,
  cores_solution = 30,
  pynmf = TRUE,
  use_conda = TRUE
)

saveRDS(sigs, file = "pcawg_cn_sigs_CN176_BP.rds")

Signature number determination

The code above call 100 NMF runs (10 bootstrap catalogs and 10 NMF runs for each bootstrap catalog) for each signature number. Only 1000 NMF runs (20 bootstrap catalogs and 50 NMF runs for each bootstrap catalog) was applied for 176-component classification approach, which has the same arguments proposed by Degasperi et al. (2020).

First, load the results.

library(sigminer)
solution1000 <- readRDS("../BP/BP_PCAWG_1000_Extraction_Result.rds")
solution100 <- readRDS("../data/pcawg_cn_sigs_CN176_BP.rds")

Next we compare some important measures for signature number determination. We can see 100 NMF runs and 1000 NMF runs can achieve very similar results.

bp_show_survey(solution1000, add_score = F, fixed_ratio = F)

bp_show_survey(solution100, add_score = F, fixed_ratio = F)

Basically, the 5 measures can be classified into 3 categories:

  1. silhouette and similarity: these two scores indicate the result stability. The signature number with value decreases sharply is preferred.
  2. distance and error: these two scores indicate the difference between raw matrix and reconstructed matrix for signatures. The signature number with lower value is better.
  3. pos cor: this is constructed by Prof. Liu and Shixiang to see the average positive correlation between signatures. A signature number should be considered if it has lower value. In practice, we find this measure has referential value sometimes.

More see ?bp_show_survey in R console for how they are calculated.

In previous study, Degasperi et al. (2020) used silhouette and error for signature number determination, PCAWG Mutational Signatures Working Group et al. (2020) used similarity and distance for signature number determination. For plots above we can see that they very close.

From plots above, we observe that the stability drops sharply after 11, so here 11 signatures is selected for this PCAWG dataset.

Similarity between two results with different parameter setting

We have two results above, so we can compare the signatures from two different parameter setting by calculating their pairwise cosine similarity.

sim <- get_sig_similarity(solution1000$object$K11, solution100$object$K11)
ComplexHeatmap::pheatmap(sim$similarity, cluster_rows = FALSE, cluster_cols = FALSE, display_numbers = TRUE)

We can know that the results contain same signature profile! This practice indicates that 100 NMF runs for each signature number is enough.

The Signature object with 11 signature profiles are stored for further analysis.

# Use root directory as work directory
saveRDS(solution1000$object$K11, file = "data/pcawg_cn_sigs_CN176_signature.rds")

Reliable signature activity attribution

After signature determination, the reliable activities (a.k.a. exposures) of signatures are attributed to each sample for downstream analysis.

Attribution pipeline

There are 3 core steps in this pipeline:

  1. Step1: We run n = 100 optimizations on bootstrapped data and obtain a distribution of activities for each sample.
  2. Step2: We consider the median activities as point estimates.
  3. step3: In each sample, we reset a signature activity to 0 if its activity distribution below the threshold of 1% of total mutations in the sample with one-side T test p value >0.05.

This pipeline is similar to previously reported Huang, Wojtowicz, and Przytycka (2018). However, there are some different settings:

  1. We use 1% of total mutations as a threshold here instead 5% in Degasperi et al. (2020), because we observed some copy number segment contributions are removed from results with 5% but they can be used for better explanation of sample copy number profile reconstruction.
  2. We use T test for P value calculation here instead of empirical P value calculation based on proportion in Degasperi et al. (2020) for keeping more positive events.

TODO: add flowchart

Apply to PCAWG catalog matrix and signatures

We apply the attribution pipeline above.

# Use root directory as work directory
library(sigminer)
tally_X <- readRDS("data/pcawg_cn_tally_X.rds")
pcawg_types <- readRDS("data/pcawg_type_info.rds")

sc <- pcawg_types$cancer_type
names(sc) <- pcawg_types$sample

pcawg_expo <- bp_attribute_activity(
  solution1000$object$K11,
  sample_class = sc,
  nmf_matrix = tally_X$nmf_matrix,
  return_class = "data.table"
)

saveRDS(pcawg_expo, file = "data/pcawg_cn_sigs_CN176_activity.rds")

Now we can know how well each sample catalog profile can be constructed from the signature combination.

pcawg_expo <- readRDS("../data/pcawg_cn_sigs_CN176_activity.rds")
summary(pcawg_expo$similarity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4446  0.8431  0.9043  0.8950  0.9649  0.9999 
hist(pcawg_expo$similarity, breaks = 100, xlab = "Reconstructed similarity", main = NA)

Most of samples are well constructed. Next we will use a similarity threshold 0.75 to filter out some samples for removing their effects on the following analysis.

TCGA copy number signatures

To validate the copy number signatures discovered in PCAWG database, we need another database for validation. TCGA is the best resource we got and similar processing and signature identification are applied.

library(sigminer)

tally_X <- readRDS("tcga_cn_tally_X.rds")
tally_X_noLOH <- readRDS("tcga_cn_tally_X_noLOH.rds")

sigs <- bp_extract_signatures(
  tally_X$nmf_matrix,
  range = 2:30,
  cores = 50,
  n_bootstrap = 10,
  n_nmf_run = 10,
  cache_dir = "tcga_bp_pynmf",
  keep_cache = TRUE,
  cores_solution = 30,
  pynmf = TRUE,
  use_conda = TRUE
)

saveRDS(sigs, file = "tcga_cn_sigs_CN176_BP.rds")

Load the solution list:

tcga_solutions <- readRDS("../data/TCGA/tcga_cn_sigs_CN176_BP.rds")
bp_show_survey(tcga_solutions, add_score = FALSE, fixed_ratio = FALSE, )

Also 11 signatures are determined based on survey plot.

Similarly, we apply the attribution pipeline to TCGA.

library(sigminer)

tcga_solutions <- readRDS("data/TCGA/tcga_cn_sigs_CN176_BP.rds")

tcga_expo <- bp_attribute_activity(
  tcga_solutions$object$K11,
  sample_class = sc,
  nmf_matrix = tally_X$nmf_matrix,
  return_class = "data.table"
)

saveRDS(tcga_expo, file = "data/TCGA/tcga_cn_sigs_CN176_activity.rds")
tcga_expo <- readRDS("../data/TCGA/tcga_cn_sigs_CN176_activity.rds")
summary(tcga_expo$similarity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2697  0.8464  0.8994  0.8878  0.9410  0.9933 
hist(tcga_expo$similarity, breaks = 100, xlab = "Reconstructed similarity", main = NA)

PART 2: Signature profile visualization and analysis

In this part, we would like to visualize and analyze all tumor samples as a whole.

Signature profile visualization

library(tidyverse)
library(sigminer)

pcawg_sigs <- readRDS(file = "../data/pcawg_cn_sigs_CN176_signature.rds")

Show signature profile with relative contribution value for each component.

show_sig_profile(
  pcawg_sigs,
  mode = "copynumber",
  method = "X",
  style = "cosmic"
)

Show signature profile with absolute contribution value (estimated segment count) for each component.

show_sig_profile(
  pcawg_sigs,
  mode = "copynumber",
  method = "X",
  style = "cosmic",
  normalize = "raw"
)

Show signature activity profile.

show_sig_exposure(
  pcawg_sigs,
  style = "cosmic",
  rm_space = TRUE,
  palette = sigminer:::letter_colors
)

Signature analysis

Signature similarity

Here we explore the similarity between copy number signatures identified in PCAWG data. High cosine similarity will cause some issues in signature assignment and activity attribution (Maura et al. 2019).

sim <- get_sig_similarity(pcawg_sigs, pcawg_sigs)
pheatmap::pheatmap(sim$similarity, cluster_cols = F, cluster_rows = F, display_numbers = TRUE)

Most of signature-pairs have similarity around 0.1.

Signature activity distribution

Transform data to long format.

df_abs <- get_sig_exposure(pcawg_sigs) %>%
  tidyr::pivot_longer(cols = starts_with("Sig"), names_to = "sig", values_to = "activity") %>%
  dplyr::mutate(
    activity = log10(activity + 1),
    sig = factor(sig, levels = paste0("Sig", 1:11))
  )

df_rel <- get_sig_exposure(pcawg_sigs, type = "relative", rel_threshold = 0) %>%
  tidyr::pivot_longer(cols = starts_with("Sig"), names_to = "sig", values_to = "activity") %>%
  dplyr::mutate(
    sig = factor(sig, levels = paste0("Sig", 1:11))
  )

Plot with function in sigminer.

show_group_distribution(
  df_abs,
  gvar = "sig",
  dvar = "activity",
  order_by_fun = FALSE,
  g_angle = 90,
  ylab = "log10(activity+1)"
)

show_group_distribution(
  df_rel,
  gvar = "sig",
  dvar = "activity",
  order_by_fun = FALSE,
  g_angle = 90,
  ylab = "relative activity"
)

Copy number profile and signature activity

Copy number profile for representative samples

Here we will draw copy number profile of 2 representative samples for each signature.

get_sample_from_sig <- function(dt, sig) {
  res <- head(dt[order(dt[[sig]], decreasing = TRUE)], 2L)
  res
}

Read CopyNumber object for PCAWG data.

pcawg_cn_obj <- readRDS("../data/pcawg_cn_obj.rds")
samp_summary <- pcawg_cn_obj@summary.per.sample
rel_activity <- get_sig_exposure(pcawg_sigs, type = "relative", rel_threshold = 0)
rel_activity <- rel_activity[, lapply(.SD, function(x) {
  if (is.numeric(x)) round(x, 3) else x
})]
dir.create("../output/enrich_samples/", showWarnings = FALSE)
for (i in paste0("Sig", 1:11)) {
  cat(paste0("Most enriched in ", i, "\n"))
  s <- get_sample_from_sig(rel_activity, i)
  print(s)
  plist <- show_cn_profile(pcawg_cn_obj,
    samples = s$sample,
    show_title = TRUE,
    return_plotlist = TRUE
  )
  plist <- purrr::map2(plist, s$sample, function(p, x) {
    s <- samp_summary[sample == x]
    text <- paste0(
      "n_of_seg:", s$n_of_seg, "\n",
      "n_of_amp:", s$n_of_amp, "\n",
      "n_of_del:", s$n_of_del, "\n"
    )
    p <- p + annotate("text",
      x = Inf, y = Inf, hjust = 1, vjust = 1,
      label = text
    )
    p
  })
  p <- cowplot::plot_grid(plotlist = plist, nrow = 2)
  print(p)
  ggplot2::ggsave(file.path("../output/enrich_samples/", paste0(i, ".pdf")),
    plot = p, width = 12, height = 6
  )
}
Most enriched in Sig1
     sample  Sig1  Sig2  Sig3  Sig4 Sig5 Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1: SP105253 0.717 0.045 0.000 0.001    0 0.03 0.183 0.000 0.003 0.003 0.018
2:  SP98941 0.682 0.200 0.002 0.000    0 0.00 0.002 0.099 0.000 0.009 0.005

Most enriched in Sig2
    sample  Sig1  Sig2  Sig3  Sig4  Sig5  Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1: SP55235 0.056 0.796 0.000 0.006 0.007 0.031 0.010 0.066 0.000     0 0.027
2: SP56303 0.019 0.779 0.004 0.000 0.000 0.102 0.064 0.026 0.005     0 0.001

Most enriched in Sig3
     sample Sig1 Sig2  Sig3  Sig4  Sig5 Sig6  Sig7 Sig8  Sig9 Sig10 Sig11
1: SP112911    0    0 0.988 0.002 0.006    0 0.000    0 0.004     0     0
2: SP112817    0    0 0.959 0.006 0.013    0 0.005    0 0.016     0     0

Most enriched in Sig4
     sample  Sig1  Sig2  Sig3  Sig4  Sig5 Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1: SP114903 0.144 0.000 0.005 0.615 0.187    0 0.008 0.003 0.030 0.008 0.000
2: SP114898 0.166 0.001 0.005 0.594 0.194    0 0.006 0.002 0.022 0.006 0.002

Most enriched in Sig5
     sample Sig1 Sig2 Sig3 Sig4 Sig5 Sig6 Sig7 Sig8 Sig9 Sig10 Sig11
1: SP105708    0    0    0    0    1    0    0    0    0     0     0
2: SP107359    0    0    0    0    1    0    0    0    0     0     0

Most enriched in Sig6
     sample  Sig1  Sig2  Sig3  Sig4  Sig5  Sig6  Sig7  Sig8 Sig9 Sig10 Sig11
1: SP116525 0.033 0.021 0.001 0.002 0.000 0.905 0.004 0.028    0 0.003 0.004
2: SP121824 0.000 0.000 0.000 0.013 0.028 0.800 0.001 0.000    0 0.002 0.156

Most enriched in Sig7
     sample  Sig1  Sig2 Sig3 Sig4 Sig5  Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1: SP116345 0.006 0.349    0    0    0 0.003 0.591 0.051 0.000 0.000     0
2: SP124197 0.069 0.104    0    0    0 0.000 0.591 0.156 0.028 0.053     0

Most enriched in Sig8
    sample  Sig1 Sig2 Sig3 Sig4 Sig5 Sig6 Sig7  Sig8 Sig9 Sig10 Sig11
1:  SP5381 0.000    0    0    0    0    0    0 1.000    0     0     0
2: SP78664 0.014    0    0    0    0    0    0 0.986    0     0     0

Most enriched in Sig9
     sample Sig1 Sig2  Sig3  Sig4  Sig5 Sig6 Sig7 Sig8  Sig9 Sig10 Sig11
1: SP106797    0    0 0.073 0.029 0.136    0    0    0 0.762     0     0
2: SP135354    0    0 0.043 0.000 0.251    0    0    0 0.705     0     0

Most enriched in Sig10
     sample  Sig1 Sig2 Sig3  Sig4  Sig5 Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1: SP117086 0.000    0    0 0.000 0.038    0 0.065 0.094 0.043 0.760     0
2: SP117223 0.071    0    0 0.007 0.197    0 0.000 0.015 0.022 0.688     0

Most enriched in Sig11
     sample  Sig1  Sig2  Sig3  Sig4  Sig5  Sig6  Sig7  Sig8  Sig9 Sig10 Sig11
1:  SP11235 0.047 0.030 0.051 0.140 0.046 0.002 0.051 0.028 0.056 0.065 0.485
2: SP112194 0.073 0.128 0.003 0.032 0.031 0.054 0.032 0.091 0.012 0.060 0.484

Ideograph for copy number profile reconstruction

TODO: select one/two samples and draw copy number profile -> catalog profile -> signature relative activity. Similar to Nature paper.

PART 3: Pan-Cancer signature analysis

In this part, we will analyze copy number signatures across cancer types and show the landscape.

Signature number and contribution in each cancer type

Load tidy cancer type annotation data.

library(sigminer)
pcawg_types <- readRDS("../data/pcawg_type_info.rds")

To better describe the copy number signature landscape, here we use refitting activity data obtained from bootstrap procedure (see section “Reliable signature activity attribution” in PART 1).

pcawg_activity <- readRDS("../data/pcawg_cn_sigs_CN176_activity.rds")

Combine the cancer type annotation and activity data and only keep samples with good reconstruction (>0.75 cosine similarity).

keep_samps <- pcawg_activity$similarity >= 0.75

df_abs <- merge(pcawg_activity$abs_activity[keep_samps], pcawg_types, by = "sample")
df_rel <- merge(pcawg_activity$rel_activity[keep_samps], pcawg_types, by = "sample")

Signature activity in each cancer type

Here we draw distribution of a signature across cancer types.

show_group_distribution(
  df_abs,
  gvar = "cancer_type",
  dvar = "Sig1",
  order_by_fun = FALSE,
  g_angle = 90,
  point_size = 0.3
)

We have many signatures here, so we output them to PDF files.

dir.create("../output/cancer-type-dist", showWarnings = F)
signames <- paste0("Sig", 1:11)
for (i in signames) {
  pxx <- show_group_distribution(df_abs,
    gvar = "cancer_type",
    dvar = i, order_by_fun = FALSE,
    ylab = i,
    g_angle = 90, point_size = 0.3
  )
  ggplot2::ggsave(file.path("../output/cancer-type-dist/", paste0("Absolute_activity_", i, ".pdf")),
    plot = pxx, width = 12, height = 6
  )
  pxx <- show_group_distribution(df_rel,
    gvar = "cancer_type",
    dvar = i, order_by_fun = FALSE,
    ylab = i,
    g_angle = 90, point_size = 0.3
  )
  ggplot2::ggsave(file.path("../output/cancer-type-dist/", paste0("Relative_activity_", i, ".pdf")),
    plot = pxx, width = 12, height = 6
  )
}
rm(pxx)

Signature landscape

Define a signature which is detectable if this signature contribute >5% exposures in a sample.

df <- df_rel %>%
  dplyr::mutate_at(dplyr::vars(dplyr::starts_with("Sig")), ~ ifelse(. > 0.5, 1L, 0L)) %>%
  tidyr::pivot_longer(
    cols = dplyr::starts_with("Sig"),
    names_to = "sig", values_to = "detectable"
  )

df2 <- df_rel %>%
  tidyr::pivot_longer(
    cols = dplyr::starts_with("Sig"),
    names_to = "sig", values_to = "expo"
  )
df <- dplyr::left_join(df, df2,
  by = c("sample", "cancer_type", "sig")
)

df_type <- df %>%
  dplyr::group_by(cancer_type, sig) %>%
  dplyr::summarise(
    freq = sum(detectable), # directly use count
    expo = median(expo[detectable == 1]),
    n = n(),
    label = paste0(unique(cancer_type), " (n=", n, ")"),
    .groups = "drop"
  )

mps <- unique(df_type[, c("cancer_type", "label")])
mpss <- mps$label
names(mpss) <- mps$cancer_type
summary(df_type$freq)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   3.395   1.000 113.000 

Show copy number signature landscape.

library(cowplot)

p <- ggplot(
  df_type,
  aes(x = cancer_type, y = factor(sig, levels = paste0("Sig", 1:11)))
) +
  geom_point(aes(size = freq, color = expo)) +
  theme_cowplot() +
  ggpubr::rotate_x_text(60) +
  scale_x_discrete(breaks = mps$cancer_type, labels = mps$label) +
  scale_size_continuous(
    limits = c(5, 230),
    breaks = c(5, 20, 50, 100, 200)
  ) +
  scale_color_gradient2(low = "white", mid = "purple", high = "black", midpoint = 0.3) +
  labs(
    x = NULL, y = "Copy number signatures",
    color = "Median exposure\ndue to signature",
    size = "Tumors with\nthe signature"
  )
p

Cancer type associated enrichment

PART 4: Cancer subtyping and prognosis analysis

In this part, we would like to see if we can cluster all tumors based on CN signatures and use the clusters to explore tumor heterogeneity.

Clustering with signature relative activity

Heatmap for clustering and genotype/phenotype features

Exploration of patients’ prognosis difference across clusters

PART 5: Inference of copy number signature etiologies

In this part, we would like to explore CN signatures in more details with association analysis and summarize what we get to infer their etiologies.

Association analysis for copy number signatures

pcawg_expo <- readRDS("../data/pcawg_cn_sigs_CN176_activity.rds")
keep_samples <- pcawg_expo$similarity >= 0.75
pcawg_expo$abs_activity <- pcawg_expo$abs_activity[keep_samples]
pcawg_expo$rel_activity <- pcawg_expo$rel_activity[keep_samples]

Signature activity association

For absolute signature activity.

show_cor(pcawg_expo$abs_activity[, -1])

For relative signature activity.

show_cor(pcawg_expo$rel_activity[, -1])

Signature clustering

We can take a try to cluster signatures based on their relative activities.

expo_mat_rel <- pcawg_expo$rel_activity %>%
  tibble::column_to_rownames("sample") %>%
  as.matrix() %>%
  t()

hc <- hclust(dist(expo_mat_rel))
factoextra::fviz_dend(hc, k = 3, rect = T)

Association between copy number signatures (CNS) and other types of signatures (SBS, DBS, ID)

CNS and SBS

CNS and DBS

CNS and ID

Signature similarity comparison between different types of signatures

比较不同 signature type 的平均相似性

Signature association network construction and visualization

Try to learn from NC paper

Association between CNS and genotype/phenotype features

Association between CNS and patients’ prognosis

Data summary and inference of signature etiologies

Supplementary Analyses

In this part, we will add some supplementary analyses.

TCGA CNS analysis

Copy number signatures

Cancer subtyping

Prognosis

Contribution of CNS to AMP/DEL events

Exploration of differnt signature extraction methods

Similarity for same number of signatures

R Session:

Many thanks to authors and contributors of all packages used in this project.

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       CentOS Linux 7 (Core)       
 system   x86_64, linux-gnu           
 ui       X11                         
 language EN                          
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Asia/Shanghai               
 date     2020-12-16                  

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version  date       lib source        
 abind            1.4-5    2016-07-21 [1] CRAN (R 4.0.2)
 assertthat       0.2.1    2019-03-21 [1] CRAN (R 4.0.2)
 backports        1.2.0    2020-11-02 [1] CRAN (R 4.0.2)
 Biobase        * 2.50.0   2020-10-27 [1] Bioconductor  
 BiocGenerics   * 0.36.0   2020-10-27 [1] Bioconductor  
 BiocManager      1.30.10  2019-11-16 [1] CRAN (R 4.0.2)
 bookdown         0.21     2020-10-13 [1] CRAN (R 4.0.2)
 broom            0.7.2    2020-10-20 [1] CRAN (R 4.0.2)
 Cairo            1.5-12.2 2020-07-07 [1] CRAN (R 4.0.2)
 callr            3.5.1    2020-10-13 [1] CRAN (R 4.0.2)
 car              3.0-10   2020-09-29 [1] CRAN (R 4.0.2)
 carData          3.0-4    2020-05-22 [1] CRAN (R 4.0.2)
 cellranger       1.1.0    2016-07-27 [1] CRAN (R 4.0.2)
 circlize         0.4.11   2020-10-31 [1] CRAN (R 4.0.2)
 cli              2.2.0    2020-11-20 [1] CRAN (R 4.0.2)
 clue             0.3-58   2020-12-03 [1] CRAN (R 4.0.2)
 cluster          2.1.0    2019-06-19 [1] CRAN (R 4.0.2)
 codetools        0.2-18   2020-11-04 [1] CRAN (R 4.0.2)
 colorspace       2.0-0    2020-11-11 [1] CRAN (R 4.0.2)
 ComplexHeatmap   2.6.2    2020-11-12 [1] Bioconductor  
 cowplot        * 1.1.0    2020-09-08 [1] CRAN (R 4.0.2)
 crayon           1.3.4    2017-09-16 [1] CRAN (R 4.0.2)
 curl             4.3      2019-12-02 [1] CRAN (R 4.0.2)
 data.table       1.13.2   2020-10-19 [1] CRAN (R 4.0.2)
 DBI              1.1.0    2019-12-15 [1] CRAN (R 4.0.2)
 dbplyr           2.0.0    2020-11-03 [1] CRAN (R 4.0.2)
 dendextend       1.14.0   2020-08-26 [1] CRAN (R 4.0.2)
 desc             1.2.0    2018-05-01 [1] CRAN (R 4.0.2)
 devtools         2.3.2    2020-09-18 [1] CRAN (R 4.0.2)
 digest           0.6.27   2020-10-24 [1] CRAN (R 4.0.2)
 doParallel       1.0.16   2020-10-16 [2] CRAN (R 4.0.2)
 dplyr          * 1.0.2    2020-08-18 [1] CRAN (R 4.0.2)
 ellipsis         0.3.1    2020-05-15 [1] CRAN (R 4.0.2)
 evaluate         0.14     2019-05-28 [1] CRAN (R 4.0.2)
 factoextra       1.0.7    2020-04-01 [1] CRAN (R 4.0.2)
 fansi            0.4.1    2020-01-08 [1] CRAN (R 4.0.2)
 farver           2.0.3    2020-01-16 [1] CRAN (R 4.0.2)
 forcats        * 0.5.0    2020-03-01 [1] CRAN (R 4.0.2)
 foreach          1.5.1    2020-08-12 [2] local         
 foreign          0.8-80   2020-05-24 [1] CRAN (R 4.0.2)
 fs               1.5.0    2020-07-31 [1] CRAN (R 4.0.2)
 furrr            0.2.1    2020-10-21 [1] CRAN (R 4.0.2)
 future           1.20.1   2020-11-03 [1] CRAN (R 4.0.2)
 generics         0.1.0    2020-10-31 [1] CRAN (R 4.0.2)
 GetoptLong       1.0.4    2020-10-19 [1] CRAN (R 4.0.2)
 ggcorrplot       0.1.3    2019-05-19 [1] CRAN (R 4.0.2)
 ggplot2        * 3.3.2    2020-06-19 [1] CRAN (R 4.0.2)
 ggplotify        0.0.5    2020-03-12 [1] CRAN (R 4.0.2)
 ggpubr           0.4.0    2020-06-27 [1] CRAN (R 4.0.2)
 ggrepel          0.8.2    2020-03-08 [1] CRAN (R 4.0.2)
 ggsignif         0.6.0    2019-08-08 [1] CRAN (R 4.0.2)
 GlobalOptions    0.1.2    2020-06-10 [1] CRAN (R 4.0.2)
 globals          0.14.0   2020-11-22 [1] CRAN (R 4.0.2)
 glue             1.4.2    2020-08-27 [1] CRAN (R 4.0.2)
 gridBase         0.4-7    2014-02-24 [1] CRAN (R 4.0.2)
 gridExtra        2.3      2017-09-09 [1] CRAN (R 4.0.2)
 gridGraphics     0.5-0    2020-02-25 [1] CRAN (R 4.0.2)
 gtable           0.3.0    2019-03-25 [1] CRAN (R 4.0.2)
 haven            2.3.1    2020-06-01 [1] CRAN (R 4.0.2)
 hms              0.5.3    2020-01-08 [1] CRAN (R 4.0.2)
 htmltools        0.5.0    2020-06-16 [1] CRAN (R 4.0.2)
 httr             1.4.2    2020-07-20 [1] CRAN (R 4.0.2)
 IRanges          2.24.0   2020-10-27 [1] Bioconductor  
 iterators        1.0.13   2020-10-15 [1] CRAN (R 4.0.2)
 jsonlite         1.7.1    2020-09-07 [1] CRAN (R 4.0.2)
 knitr          * 1.30     2020-09-22 [1] CRAN (R 4.0.2)
 labeling         0.4.2    2020-10-20 [1] CRAN (R 4.0.2)
 lifecycle        0.2.0    2020-03-06 [1] CRAN (R 4.0.2)
 listenv          0.8.0    2019-12-05 [1] CRAN (R 4.0.2)
 lubridate        1.7.9.2  2020-11-13 [1] CRAN (R 4.0.2)
 magrittr         2.0.1    2020-11-17 [1] CRAN (R 4.0.2)
 matrixStats      0.57.0   2020-09-25 [1] CRAN (R 4.0.2)
 memoise          1.1.0    2017-04-21 [1] CRAN (R 4.0.2)
 modelr           0.1.8    2020-05-19 [1] CRAN (R 4.0.2)
 munsell          0.5.0    2018-06-12 [1] CRAN (R 4.0.2)
 NMF              0.23.0   2020-08-01 [1] CRAN (R 4.0.2)
 openxlsx         4.2.3    2020-10-27 [1] CRAN (R 4.0.2)
 parallelly       1.21.0   2020-10-27 [1] CRAN (R 4.0.2)
 pheatmap         1.0.12   2019-01-04 [1] CRAN (R 4.0.2)
 pillar           1.4.7    2020-11-20 [1] CRAN (R 4.0.2)
 pkgbuild         1.1.0    2020-07-13 [1] CRAN (R 4.0.2)
 pkgconfig        2.0.3    2019-09-22 [1] CRAN (R 4.0.2)
 pkgload          1.1.0    2020-05-29 [1] CRAN (R 4.0.2)
 pkgmaker         0.32.2   2020-10-20 [1] CRAN (R 4.0.2)
 plyr             1.8.6    2020-03-03 [1] CRAN (R 4.0.2)
 png              0.1-7    2013-12-03 [2] CRAN (R 4.0.2)
 prettyunits      1.1.1    2020-01-24 [1] CRAN (R 4.0.2)
 processx         3.4.5    2020-11-30 [1] CRAN (R 4.0.2)
 ps               1.5.0    2020-12-05 [1] CRAN (R 4.0.2)
 purrr          * 0.3.4    2020-04-17 [1] CRAN (R 4.0.2)
 R.cache          0.14.0   2019-12-06 [1] CRAN (R 4.0.2)
 R.methodsS3      1.8.1    2020-08-26 [1] CRAN (R 4.0.2)
 R.oo             1.24.0   2020-08-26 [1] CRAN (R 4.0.2)
 R.utils          2.10.1   2020-08-26 [1] CRAN (R 4.0.2)
 R6               2.5.0    2020-10-28 [1] CRAN (R 4.0.2)
 RColorBrewer     1.1-2    2014-12-07 [1] CRAN (R 4.0.2)
 Rcpp             1.0.5    2020-07-06 [2] CRAN (R 4.0.2)
 readr          * 1.4.0    2020-10-05 [1] CRAN (R 4.0.2)
 readxl           1.3.1    2019-03-13 [1] CRAN (R 4.0.2)
 registry         0.5-1    2019-03-05 [1] CRAN (R 4.0.2)
 rematch2         2.1.2    2020-05-01 [1] CRAN (R 4.0.2)
 remotes          2.2.0    2020-07-21 [1] CRAN (R 4.0.2)
 reprex           0.3.0    2019-05-16 [1] CRAN (R 4.0.2)
 reshape2         1.4.4    2020-04-09 [1] CRAN (R 4.0.2)
 RevoUtils      * 11.0.2   2020-08-12 [2] local         
 RevoUtilsMath  * 11.0.0   2020-08-12 [2] local         
 rio              0.5.16   2018-11-26 [1] CRAN (R 4.0.2)
 rjson            0.2.20   2018-06-08 [1] CRAN (R 4.0.2)
 rlang            0.4.9    2020-11-26 [1] CRAN (R 4.0.2)
 rmarkdown        2.5      2020-10-21 [1] CRAN (R 4.0.2)
 rmdformats     * 1.0.0    2020-11-23 [1] CRAN (R 4.0.2)
 rngtools         1.5      2020-01-23 [1] CRAN (R 4.0.2)
 rprojroot        2.0.2    2020-11-15 [1] CRAN (R 4.0.2)
 rstatix          0.6.0    2020-06-18 [1] CRAN (R 4.0.2)
 rstudioapi       0.13     2020-11-12 [1] CRAN (R 4.0.2)
 rvcheck          0.1.8    2020-03-01 [1] CRAN (R 4.0.2)
 rvest            0.3.6    2020-07-25 [1] CRAN (R 4.0.2)
 S4Vectors        0.28.0   2020-10-27 [1] Bioconductor  
 scales           1.1.1    2020-05-11 [1] CRAN (R 4.0.2)
 sessioninfo      1.1.1    2018-11-05 [1] CRAN (R 4.0.2)
 shape            1.4.5    2020-09-13 [1] CRAN (R 4.0.2)
 sigminer       * 1.1.0.99 2020-12-16 [1] local         
 stringi          1.5.3    2020-09-09 [1] CRAN (R 4.0.2)
 stringr        * 1.4.0    2019-02-10 [1] CRAN (R 4.0.2)
 styler           1.3.2    2020-02-23 [1] CRAN (R 4.0.2)
 testthat         3.0.0    2020-10-31 [1] CRAN (R 4.0.2)
 tibble         * 3.0.4    2020-10-12 [1] CRAN (R 4.0.2)
 tidyr          * 1.1.2    2020-08-27 [1] CRAN (R 4.0.2)
 tidyselect       1.1.0    2020-05-11 [1] CRAN (R 4.0.2)
 tidyverse      * 1.3.0    2019-11-21 [1] CRAN (R 4.0.2)
 usethis          1.6.3    2020-09-17 [1] CRAN (R 4.0.2)
 vctrs            0.3.5    2020-11-17 [1] CRAN (R 4.0.2)
 viridis          0.5.1    2018-03-29 [1] CRAN (R 4.0.2)
 viridisLite      0.3.0    2018-02-01 [1] CRAN (R 4.0.2)
 withr            2.3.0    2020-09-22 [1] CRAN (R 4.0.2)
 xfun             0.19     2020-10-30 [1] CRAN (R 4.0.2)
 xml2             1.3.2    2020-04-23 [1] CRAN (R 4.0.2)
 xtable           1.8-4    2019-04-21 [1] CRAN (R 4.0.2)
 yaml             2.2.1    2020-02-01 [1] CRAN (R 4.0.2)
 zip              2.1.1    2020-08-27 [1] CRAN (R 4.0.2)

[1] /home/public/R/library
[2] /opt/microsoft/ropen/4.0.2/lib64/R/library

References

Degasperi, Andrea, Tauanne Dias Amarante, Jan Czarnecki, Scott Shooter, Xueqing Zou, Dominik Glodzik, Sandro Morganella, et al. 2020. “A Practical Framework and Online Tool for Mutational Signature Analyses Show Intertissue Variation and Driver Dependencies.” Nature Cancer 1 (2): 249–63. https://doi.org/10.1038/s43018-020-0027-5.
Huang, Xiaoqing, Damian Wojtowicz, and Teresa M Przytycka. 2018. “Detecting Presence of Mutational Signatures in Cancer with Confidence.” Edited by Christina Curtis. Bioinformatics 34 (2): 330–37. https://doi.org/10.1093/bioinformatics/btx604.
Maura, Francesco, Andrea Degasperi, Ferran Nadeu, Daniel Leongamornlert, Helen Davies, Luiza Moore, Romina Royo, et al. 2019. “A Practical Guide for Mutational Signature Analysis in Hematological Malignancies.” Nature Communications 10 (1): 2969. https://doi.org/10.1038/s41467-019-11037-8.
PCAWG Mutational Signatures Working Group, PCAWG Consortium, Ludmil B. Alexandrov, Jaegil Kim, Nicholas J. Haradhvala, Mi Ni Huang, Alvin Wei Tian Ng, et al. 2020. “The Repertoire of Mutational Signatures in Human Cancer.” Nature 578 (7793): 94–101. https://doi.org/10.1038/s41586-020-1943-3.